Abstract:Deep reinforcement learning(DRL) theory and applied research are deepening and it is now playing an important role in games, robot control, dialogue systems, automatic driving, etc. Meanwhile, due to shortcomings such as exploration-exploitation dilemma, sparse reward, sample collection hardness, poor model stability, DRL still has many problems for which researchers have proposed various solutions. New theories has further promoted the development of DRL, and opened up several new research fields of reinforcement learning, such as imitative learning, hierarchical reinforcement learning and meta-learning. This paper aims to explore and summarize future development of DRL, and a brief introduction of DRL theory, difficulties and applications is presented at the same time.
万里鹏, 兰旭光, 张翰博, 郑南宁,. 深度强化学习理论及其应用综述[J]. 模式识别与人工智能, 2019, 32(1): 67-81.
WAN Lipeng, LAN Xuguang, ZHANG Hanbo, ZHENG Nanning. A Review of Deep Reinforcement Learning Theory and Application. , 2019, 32(1): 67-81.
[1] ARULKUMARAN K, DEISENROTH M P, BRUNDAGE M, et al. A Brief Survey of Deep Reinforcement Learning. IEEE Signal Processing Magazine, 2017, 34(6): 26-38. [2] HOU J, LI H, HU J W, et al. A Review of the Applications and Hotspots of Reinforcement Learning // Proc of the IEEE International Conference on Unmanned Systems. Washington, USA: IEEE, 2017: 506-511. [3] GOSAVI A . Reinforcement Learning: A Tutorial Survey and Recent Advances. INFORMS Journal on Computing, 2009, 21(2): 178-192. [4] WATKINS C J C H, DAYAN P. Q-learning. Machine learning, 1992, 8(3/4): 279-292. [5] MNIH V, KAVUKCUOGLU K, SILVER D, et al. Playing Atari with Deep Reinforcement Learning[C/OL]. [2018-12-26]. https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf. [6] SILVER D, HUANG A, MADDISON C, et al. Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature, 2016, 529(7587): 484-489. [7] JOSH M, YUVAL T, DRUVA T B, et al. Learning Human Beha-viors from Motion Capture by Adversarial Imitation[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1707.02201.pdf. [8] FINN C, LEVINE S. Deep Visual Foresight for Planning Robot Motion[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1610.00696.pdf. [9] LEWIS M, YARATS D, DAUPHIN Y N, et al. Deal or no Deal? End-to-End Learning for Negotiation Dialogues[C/OL]. [2018-12-26]. http://aclweb.org/anthology/D17-1259. [10] WEISZ G, BUDZIANOWSKI P, SU P H, et al. Sample Efficient Deep Reinforcement Learning for Dialogue Systems with Large Action Spaces. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018, 26(11): 2083-2097. [11] ZHANG H J, ZHAO J, WANG R, et al. Multi-objective Reinforcement Learning Algorithm and Its Application in Drive System // Proc of the 34th Annual Conference of IEEE Industrial Electronics. Washington, USA: IEEE, 2008: 274-279. [12] DERHAMI V, PAKSIMA J, KHAJAH H. Web Pages Ranking Algorithm Based on Reinforcement Learning and User Feedback. Journal of AI and Data Mining, 2015, 3(2): 157-168. [13] TAN J, ZHANG T N, COUMANS E, et al. Sim-to-Real: Learning Agile Locomotion for Quadruped Robots[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1804.10332.pdf. [14] SUTTON R S, BARTO A G . Reinforcement Learning: An Introduction. Cambridge, USA: The MIT Press, 2017. [15] POLYDOROS A S, NALPANTIDIS L. Survey of Model-Based Reinforcement Learning: Applications on Robotics. Journal of Intelligent and Robotic Systems, 2017, 86(2): 153-173. [16] TEREJANU G, SINGLA P, SINGH T, et al. Uncertainty Propagation for Nonlinear Dynamic Systems Using Gaussian Mixture Models. Journal of Guidance, Control, and Dynamics, 2008, 31(6): 1623-1633. [17] KHANSARI-ZADEH S M, BILLARD A. BM: An Iterative Algorithm to Learn Stable Non-linear Dynamical Systems with Gaussian Mixture Models // Proc of the IEEE International Conference on Robotics and Automation. Washington, USA: IEEE, 2011: 2381-2388. [18] BELLMAN R. On the Theory of Dynamic Programming. Proceedings of the National Academy of Sciences of the United Stated of America, 1952, 38(8): 716-719. [19] VAN HASSELT H, GUEZ A, SILVER D. Deep Reinforcement Learning with Double Q-Learning[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1509.06461.pdf. [20] WANG Z Y, SCHAUL T, HESSEL M, et al. Dueling Network Architectures for Deep Reinforcement Learning // Proc of the 33rd International Conference on Machine Learning. New York, USA: ACM, 2016: 1995-2003. [21] HAUSKNECHT M, STONE P. Deep Recurrent Q-Learning for Partially Observable MDPs[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1507.06527.pdf. [22] BABAEIZADEH M, FROSIO I, TYREE S, et al. Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1611.06256.pdf. [23] WANG Z Y, BAPST V, HEESS N, et al. Sample Efficient Actor-Critic with Experience Replay[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1611.01224.pdf. [24] GU S X, LILLICRAP T, SUTSKEVER I, et al. Continuous Deep Q-Learning with Model-Based Acceleration // Proc of the 33rd International Conference on Machine Learning. New York, USA: ACM, 2016: 2829-2838. [25] LILLICRAP T P, HUNT J J, PRITZEL A, et al. Continuous Control with Deep Reinforcement Learning[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1509.02971.pdf. [26] GU S X, LILLICRAP T, GHAHRAMANI Z, et al. Q-Prop: Sample-Efficient Policy Gradient with an Off-Policy Critic[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1611.02247.pdf. [27] BADIA M V, ADRIA P, MIRZA M, et al. Asynchronous Methods for Deep Reinforcement Learning[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1602.01783.pdf. [28] BANSAL S, AKAMETALU A K, JIANG F K, et al. Learning Quadrotor Dynamics Using Neural Network for Flight Control // Proc of the 15th IEEE Conference on Decision and Control. Washington, USA: IEEE, 2016: 4653-4660. [29] GILRA A, GERSTNER W. Predicting Non-linear Dynamics by Stable Local Learning in a Recurrent Spiking Neural Network[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1702.06463.pdf. [30] NAGABANDI A, KAHN G, FEARING R S, et al. Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1708.02596.pdf. [31] OH J, SINGH S, LEE H. Value Prediction Network // GUYAN I, LUXBURG U V, BENGIO S, et al., eds. Advances in Neural Information Processing Systems 30. Cambridge, USA: The MIT Press, 2017: 6118-6128. [32] ANDREW B J. An Invitation to Imitation[C/OL]. [2018-12-26]. https://ri.cmu.edu/pub_files/2015/3/InvitationToImitation_3_1415.pdf. [33] BARTO A G, MAHADEVAN S. Recent Advances in Hierarchical Reinforcement Learning. Discrete Event Dynamic Systems, 2003, 13(4): 341-379. [34] MIRNWSKI P, PASCANU R, VIOLA F, et al. Learning to Navigate in Complex Environments[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1611.03673.pdf. [35] PATHAK D, AGRAWALP, EFROS A A, et al. Curiosity-Driven Exploration by Self-Supervised Prediction // Proc of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. Washington, USA: IEEE, 2017: 488-489. [36] KANGASRÄÄSIÖ A, KASKI S. Inverse Reinforcement Learning from Incomplete Observation Data[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1703.09700.pdf. [37] AUER P, CESA-BIANCHI N, FISCHER P. Finite-Time Analysis of the Multiarmed Bandit Problem. Machine Learning, 2002, 47(2/3): 235-256. [38] AUER P. Using Confidence Bounds for Exploitation-Exploration Trade-offs. Journal of Machine Learning Research, 2002, 3: 397-422. [39] MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human-Level Control through Deep Reinforcement Learning. Nature, 2015, 518(7540): 529-533. [40] FORTUNATO M, AZAR M G, POIT B, et al. Noisy Networks for Exploration[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1706.10295.pdf. [41] COLAS C, SIGAUD O, OUDEYER P Y. GEP-PG: Decoupling Exploration and Exploitation in Deep Reinforcement Learning Algorithms[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1802.05054.pdf. [42] OH J, GUO Y J, SINGH S, et al. Self-Imitation Learning[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1806.05635.pdf. [43] XU T B, LIU Q, ZHAO L, et al. Learning to Explore with Meta-Policy Gradient[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1803.05004.pdf. [44] MEL V, HESTER T, SCHOLZ J, et al. Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1707.08817.pdf. [45] ZEESTRATEN M J A, HAVOUTIS I, SILVÉRIO J, et al. An Approach for Imitation Learning on Riemannian Manifolds. IEEE Robotics and Automation Letters, 2017, 2(3): 1240-1247. [46] ABBEEL P, NG A Y. Apprenticeship Learning via Inverse Reinforcement Learning // Proc of the 21st International Conference on Machine Learning. New York, USA: ACM, 2004. DOI: 10.1145/1015330.1015430. [47] NG A Y, RUSSELL S J. Algorithms for Inverse Reinforcement Learning // Proc of the 17th International Conference on Machine Learning. San Francisco, USA: Morgan Kaufmann, 2000: 663-670. [48] HO J, ERMON S. Generative Adversarial Imitation Learning[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1606.03476.pdf. [49] HO J , GUPTA J K , ERMON S . Model-Free Imitation Learning with Policy Optimization // Proc of the 33rd International Conference on Machine Learning. New York, USA: ACM, 2016: 2760-2769. [50] DUAN Y, ANDRYCHOWICZ M, STADIE B C, et al. One-Shot Imitation Learning[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1703.07326.pdf. [51] WANG Z Y, MEREL J, REED S, et al. Robust Imitation of Diverse Behaviors[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1707.02747.pdf. [52] NICHOL A, ACHIAM J, SCHULMAN J. On First-Order Meta-Learning Algorithms[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1803.02999.pdf. [53] AL-SHEDIVAT M, BANSAL T, BURDA Y, et al. Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1710.03641.pdf. [54] ABEL D, ARUMUGAM D, LEHNERT L, et al. State Abstractions for Lifelong Reinforcement Learning // Proc of the 35th International Conference on Machine Learning. New York, USA: ACM, 2018: 10-19. [55] ABEL D, JINNAL Y, GUO Y, et al. Policy and Value Transfer in Lifelong Reinforcement Learning // Proc of the 35th International Conference on Machine Learning. New York, USA: ACM, 2018: 20-29. [56] FINN C, YU T H, ZHANG T H, et al. One-Shot Visual Imitation Learning via Meta-Learning[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1709.04905.pdf. [57] FINN C, ABBEEL P, LEVINE S. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1703.03400v3.pdf. [58] RITTER S, WANG J X, KUTH-NELSON Z, et al. Been There, Done That: Meta-Learning with Episodic Recall[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1805.09692.pdf. [59] SUTTON R S, PRECUP D, SINGH S. Between MDPs and semi-MDPs: A framework for Temporal Abstraction in Reinforcement Learning. Artificial Intelligence, 1999, 112(1/2): 181-211. [60] HAARNOJA T, HARTIKAINEN K, ABBEEL P, et al. Latent Space Policies for Hierarchical Reinforcement Learning[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1804.02808.pdf. [61] FANG K, ZHU Y K, GARG A, et al. Learning Task-Oriented Grasping for Tool Manipulation from Simulated Self-Supervision[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1806.09266.pdf. [62] MURALI A, PINTO L, GANDHI D, et al. CASSL: Curriculum Accelerated Self-supervised Learning[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1708.01354.pdf. [63] ANDRYCHOWICZ M, WOLSKI F, RAY A, et al. Hindsight Experience Replay[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1707.01495v1.pdf. [64] JADERBERG M, MNIH V, CZARNECKI W M, et al. Reinforce-ment Learning with Unsupervised Auxiliary Tasks[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1611.05397.pdf. [65] RIEDMILLER M, HAFNER R, LAMPE T, et al. Learning by Playing-Solving Sparse Reward Tasks from Scratch[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1802.10567.pdf. [66] DEISENROTH M, RASMUSSEN C E. PILCO: A Model-Based and Data-Efficient Approach to Policy Search // Proc of the 28th International Conference on Machine Learning. New York, USA: Omnipress, 2011: 465-472. [67] MCALLISTER R, VAN DER WILK M, RASMUSSEN C E. Data-Efficient Policy Search Using PILCO and Directed-Exploration[C/OL]. [2018-12-26]. http://people.eecs.berkeley.edu/~rmcallister/files/epilco.pdf. [68] GAL Y, MCALLISTER R T, RASMUSSEN C E. Improving PILCO with Bayesian Neural Network Dynamics Models[C/OL]. [2018-12-26]. http://www.cs.ox.ac.uk/people/yarin.gal/website/PDFs/DeepPILCO.pdf. [69] LEVINE S, KOLTUN V. Guided Policy Search[C/OL]. [2018-12-26]. https://graphics.stanford.edu/projects/gpspaper/gps_full.pdf. [70] TASSA Y, EREZ T, TODOROV E. Synthesis and Stabilization of Complex Behaviors through Online Trajectory Optimization // Proc of the IEEE/RSJ International Conference on Intelligent Robots and Systems. Washington, USA: IEEE, 2012: 4906-4913. [71] LEVINE S, ABBEEL R. Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics // GHAHRAMANI Z, WELLING M, CORTES C, et al. Advances in Neural Information Processing Systems 27. 2014: 1-9. [72] HAARNOJA T, ZHOU A, ABBEEL P, et al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1801.01290.pdf. [73] SCHULMAN J, LEVINE S, MORITZ P, et al. Trust Region Policy Optimization // Proc of the 31st International Conference on Machine Learning. New York, USA: ACM, 2015: 1889-1897. [74] SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal Policy Optimization Algorithms[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1707.06347.pdf. [75] LATAFAT P, FRERIS N M, PATRINOS P. A New Randomized Block-Coordinate Primal-Dual Proximal Algorithm for Distributed Optimization[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1706.02882.pdf. [76] GREENSMITH E, BARTLETT P L, BAXTER J. Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning. Journal of Machine Learning Research, 2004, 5: 1471-1530. [77] DEGRIS T, WHITE M, SUTTON R S. Off-Policy Actor-Critic[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1205.4839.pdf. [78] FUJITA Y, MAEDA S. Clipped Action Policy Gradient[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1802.07564.pdf. [79] BAIER-LÖWENSTEIN T, ZHANG J W. Learning to Grasp Everyday Objects Using Reinforcement-Learning with Automatic Value Cut-off // Proc of the IEEE/RSJ International Conference on Intelligent Robots and Systems. Washington, USA: IEEE, 2007: 1551-1556. [80] REZZOUG N, GORCE P, ABELLARD A, et al. Learning to Grasp in Unknown Environment by Reinforcement Learning and Shaping // Proc of the IEEE International Conference on Systems, Man and Cybernetics. Washington, USA: IEEE, 2006, VI: 4487-4492. [81] CHEBOTAR Y, HAUSMAN K, SU Z, et al. Self-Supervised Regrasping Using Spatio-Temporal Tactile Features and Reinforcement Learning // Proc of the IEEE/RSJ International Conference on Intelligent Robots and Systems. Washington, USA: IEEE, 2016: 1960-1966. [82] KATYAL K, WANG I J, BURLINA P. Leveraging Deep Reinforcement Learning for Reaching Robotic Tasks // Proc of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. Washington, USA: IEEE, 2017:490-491. [83] SILVER D, SCHRITTWIESER J, SIMONYAN K, et al. Mastering the Game of Go without Human Knowledge. Nature, 2017, 550(7676): 354-359. [84] KENDALL A, HAWKE J, JANZ D, et al. Learning to Drive in a Day[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1807.00412.pdf. [85] SHARIFZADEH S, CHIOTELLIS I, TRIEBEL R, et al. Learning to Drive Using Inverse Reinforcement Learning and Deep Q-Networks[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1612.03653.pdf. [86] DHINGRA B, LI L H, LI X J, et al. Towards End-to-End Reinforcement Learning of Dialogue Agents for Information Access[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1609.00777.pdf. [87] SERHAMI V, SANKAR C, GERMANIN M, et al. A Deep Reinforcement Learning Chatbot[C/OL]. [2018-12-26]. https://arxiv.org/pdf/1709.02349v1.pdf. [88] DERHAMI V, KHODADADIAN E, GHASEMZADEH M, et al. Applying Reinforcement Learning for Web Pages Ranking Algorithms. Applied Soft Computing, 2013, 13(4): 1686-1692. [89] YUAN J J, JIANG X, ZHONG L, et al. Energy Aware Resource Scheduling Algorithm for Data Center Using Reinforcement Learning // Proc of the 5th International Conference on Intelligent Computation Technology and Automation. New York, USA: ACM, 2012: 435-438.